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ABSTRACT: A plan to update an organization's data warehouse must begin 
with a clear assessment of the objectives of the data repository and a firm 
grasp of the relevant business objectives. Elements of data warehouse 
implementation include drawing data from a range of sources, managing and 
storing the data, providing access to the end-user, providing for 
multi-dimensional analysis of the data and providing attractive 
presentational options. A key issue is whether or not the relational 
database (RDBMS) will have to be replaced with a Multi Dimensional DBMS 
system specifically designed for data warehouse applications- Most RDBMS 
systems are only able to accomplish On-Line Transactional Processing (OLTP) 
operations, while data warehousing requires On-Line Analytical Processing 
(OLAP) . Users of a data warehouse can be categorized as power users, 
executive users and casual users. Guidelines for the selection of a 
commercial package include cost, function, performance level, scalability, 
open systems capability and vendor reliability. 

TEXT: 

When implementing a data warehouse project, you need to first define 
the objectives and decision criteria before evaluating the alternatives and 
developing the plan. 

Your executives have read the articles in the business journals and 
are convinced that a data warehouse has the potential of significantly 
improving their competitive position, and they want one. You know that a 
data warehouse system is developed, not purchased "off the shelf." As a key 
member of the team assigned to plan and implement the data warehouse, you 
have many implementation questions. This article identifies the 
implementation considerations and discusses the decision criteria. 

A data warehouse is the repository that contains all the data of 
potential interest to the firm. This purposefully broad definition suggests 
that almost all data may be stored, accessed, analyzed and presented by 
someone, sometime. It may contain current operational data summarized along 
several dimensions, historical data and metadata. There may be industrywide 
economic data or even unstructured data. Some firms manage pedabytes 
(10. sup. 15) of data. 

Data mining is the process used to analyze and summarize data from 
many perspectives to determine patterns so users can obtain relevant 
information necessary and appropriate to their questions. 

Define Goals And Prioritize Objectives 

As with any project, before you begin a data warehouse project there 
must be a clear description and agreement of the goals and objectives. Any 
information system should exist solely to support the firm's business 
objectives. Therefore, it is critical that management clearly define its 
near-term and long-term goals. 

The obvious question is: What is the relative importance of these 
objectives? While management may indicate that all are equally important, 
project planners and developers need to know the tradeoffs and priorities. 



Often, near-term objectives and priorities are different from 
long-term ones. The short-term objective may be to develop a successful 
pilot application using Rapid Application Development (RAD) techniques. 
This pilot project can then be used to demonstrate the benefits of a data 
warehouse and to learn from the pilot, refine the techniques and extend 
them to other functions and users. 

There are five major elements involved in implementing a data 
warehouse: 

1. Extract, transform and load the data warehouse from the various 
sources 

2. Store and manage the data warehouse 

3. End-user access 

4. Data analysis from different perspectives 

5. Presentation of the results. 

When loading the data Warehouse, the source data usually must be 
"cleaned up," extracted, transformed and denormalized before it can be 
loaded into the data warehouse. The source data may be operational data and 
generation data sets in the RDBMS 1 format, hierarchical databases and flat 
files such as EBCDIC and ASCII files. The data warehouse may contain 
unstructured data in electronic mail, notes and trade journal articles as 
well as external data in public databases such as the Internet, Dun & 
Bradstreet files, Department of Commerce Statistics, etc. You can write 
programs to load the data warehouse or purchase one of many vendor 
products . 

If the organization already has an RDBMS, a key question is whether 
the data in the data warehouse should be stored and managed by the same 
RDBMS or whether a Mul tiDimensional DBMS (MD-DBMS) specifically designed 
for data warehouse and dam mining should be used. The advantages of using 
the existing RDBMS include the fact that the staff is already familiar with 
it and the infrastructure and procedures are already in place. However, a 
key disadvantage is that these RDBMS products are designed for On-Line 
Transactional Processing (OLTP) and not for On-Line Analytical Processing 
(OLAP) of the data warehouse. An RDBMS generally can manage only structured 
data, and OLAP has significantly different characteristics from OLTP. OLAP 
systems must be able to process tens and even hundreds of gigabytes of data 
with reasonable and consistent performance and provide multiple views. Many 
contend that existing RDBMS products cannot effectively satisfy 
multidimensional OLAP requirements. 

The advantages of using an MD-DBMS include the ability to more 
efficiently handle large data amounts through better data storage and 
indexing techniques. Some are specifically architected to exploit the power 
of Massively Parallel Processors (MPPs) . The disadvantages include having 
to develop the infrastructure to learn, install and support another DBMS 
(manuals, education, training, etc.). You also may have another vendor, 
with the resulting interface complexities. 

The access and presentation functions can be adequately satisfied by 
any number of excellent products with a GUI. Unfortunately , there are few 
such products available that execute on the mainframe; thus, many data 
warehouse systems are implemented in part on a PC or in a client/server 
envi ronment . 

Many data mining products perform the basic operations necessary for 
multidimensional OLAP summarization and analysis (slice and dice, 
consolidation, drill down and basic analysis). A few products provide 
statistical functions and even artificial intelligence capabilities for 
more sophisticated analysis. 

User Classes And Requirements 

One of the first considerations is to categorize users and determine 
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their requirements. One useful way is to group users into three broad 
categories on the basis of their different interests and requirements: 

* Power users 

* Executives 

* Casual users. 

When implementing a data warehouse, the challenge is to anticipate and 
satisfy the needs of these different users (see Table 1). Since many users 
may already be familiar with different end-user products, data mining tools 
should have interfaces to popular query and report writer products and 
spreadsheets, and provide exits to the available mathematical and 
statistical libraries. This imposes greater support requirements on the 
help desk. 

Preferably, business professionals and/or executives should not 
require the assistance of an I/S professional. Users should be able to 
perform analysis in a manner consistent with their professions. For 
example, a financial analyst might use terms such as "calculate the yield 
to maturity," or "what is the internal rate of return"; an accountant might 
use terms such as "net profit," "earnings per share" or "pro-forma" ; a 
sales manager might make a request to "compare the sales performance of 
each salesperson across all regions" or "compare the last 12 months of 
sales of the new product line with the sales of the original line," etc. 

Evaluation Considerations 

When selecting a data warehouse product or vendor, the planning team 
should develop evaluation and criteria weights before evaluating and 
selecting any product or vendor. These include: 

* Economic 

* Required function 

* Usability 

* Desired level of performance 

* Scalability * 

* Openness 

* Hardware/software platform environment: 

* relative advantages of mainframes, client/server, PC 

* use existing software or install new? 

* Vendor considerations: 

* best-of-breed components or single vendor? 

* vendor viability and support (education, consulting). 

A simple 1 to 5 evaluation scale is usually used. Identifying all the 
relative factors results in better accuracy, which is more important than 
precision. The criteria and weights should reflect your specific business 
environment and requirements. Any composite score should be used to 
indicate the relative, not absolute, evaluation. 

An OLAP system is usually on a separate hardware system from the OLTP 
system, so it does not impact the operational and tactical needs of the 
organization. 

An important consideration is to decide on the hardware and software 
platforms. The primary focus of this article is on products that execute 
either on an MPP or mainframe, or cooperate with a mainframe in a 
client/server environment. Of less interest are those products that execute 
only on stand-alone personal computers or Intelligent Work Stations 
(IWSes). Often, the extracted data and/or interim results are transferred 
to an IWS for subsequent analysis by the user. 

Table 2 is a summary of the strengths of various hardware and software 
platforms. These considerations now will be "sliced and diced" and "drilled 
down . " 

Economic considerations should include the explicit hardware, software 
and network, support costs and benefits, and the hidden or intangible 



costs. Many firms do not adequately consider the intangible economic costs, 
such as lost productivity, that occur when each user must act as a system 
administrator. Recent studies suggest that the total cost of a PC-centric 
or client/server system is much higher than initially thought. In some 
instances, it is five times greater than the cost of a mainframe-based 
system. The intangible benefits include the value of better, more timely 
information. 

Price is a consideration, but more important is the relative value or 
ratio of achievable benefits to price. Users realize the importance of 
timely, useful information and are willing to pay for value received. The 
total cost of the hardware, prerequisite and OLAP software and the 
opportunity cost of late or inaccurate information should be included in 
the justification of the system. 

Users should be provided with choices regarding the level of accuracy, 
desired precision, format of the query, statistical analysis techniques, 
level of sophistication of the model, resulting output and format. Examples 
include choices regarding graphical displays (line, bar, stack, pie, 
three-dimensional), colors and shading, equations, textual reports and 
spreadsheets . 

Accuracy is not the same as precision; it is more important that users 
be provided with results that are accurate enough for the intended use than 
that they be extremely precise. For example, when preparing next year's 
budget, it is preferable that the manager know that the average salary is 
approximately $40,000 than to be told that it is $41,123.48, plus or minus 
$10,000. 

Simplicity and intuitive end-user access and usability are key 
requirements for all users. Minimally, the products should provide a GUI 
with windows, menus, icons, tool bars, etc., and use nonprocedural SQL 
language. 

Because the product may operate across multiple hardware/software 
platforms, the user interface should be consistent across all supported 
environments to minimize any education and retraining. Using the product 
should be intuitive and self-explanatory. There should be on-line, 
context-driven help that allows the user to obtain information and guidance 
regarding a topic by placing the cursor on the subject. By striking a key, 
a M pop-down M window appears with increasing levels of explanation of the 
topic. The user does not have to key in the subject; the system 
automatically provides information for the topic indicated by the cursor. 

As with 0LTP, the response time for OLAP depends on the hardware and 
software platforms, the nature of the analysis and the load on the system. 
The objective for OLAP can be several orders of magnitude longer than for 
0LTP. If the normal OLTP response time on a platform is n seconds, the 
following guidelines for the OLAP response time are proposed: 

* Simple drill down and summarization -5 to lOn 

* Statistical and regression analysis, mathematical analysis (e.g., 
curve fitting) - 10 to 50 n 

* Artificial Intelligence (AI) - depending on the technique, this 
could take many minutes. In most instances, the AI tool will initiate a 
task to be executed in a batch partition. Scalability is the property that 
provides support for additional users, larger databases and higher 
performance by adding more computer resources (more storage capacity, 
processing power, terminals, etc.) without changing the fundamental 
operating environment, application or operating procedures. Doubling the 
power should allow doubling the number of users with no degradation in 
throughput or performance; doubling resources should provide the same 
number of users with twice the performance. 

Scalability can be achieved vertically (adding resources to a single 



processor) or horizontally (by enabling multiple processors to 
cooperatively operate and share the workload transparently). Typically, 
systems are scaled using: 

* Larger uniprocessors 

* Tightly coupled multiprocessors sharing memory and DASD 

* A loosely coupled multiprocessor that may share DASD but have 
separate processors or main memory 

* MPPs with hundreds of processors. 

Openness is the ability to interface with other vendors' products and 
other hardware and software platforms. The data should be importable from 
and exportable to the popular word processing, database and spreadsheet 
products . 

In a concurrent multiuser access and usability environment, data 
integrity, data security and privacy are critically important. The 
organization needs to balance usability and accessibility with the need to 
protect the fundamentally valuable corporate data asset against 
unauthorized access. Recent articles have discussed instances where 
businesses incurred billions of dollars in lost or damaged data because of 
inadequate data access security protection against malicious actions, fraud 
and computer viruses. 

When building a data warehouse, you may select products from the same 
vendor (or from its business partners) or you may select n best-of-breed M 
products from several vendors and integrate them into a cohesive system. 
With the latter, you can be the systems integrator or rely on one of the 
many consulting firms that offer integration installation and turnkey 
services. If data mining system products are selected from multiple 
vendors, then they must be integrated to work together (not a trivial 
task) . 

The "staying power" and viability of a vendor is as important as the 
functional characteristics of the product. Recently, there have been 
numerous consolidations, restructuring and mergers in I/T. This has 
resulted in many software products being stabilized or no longer being 
marketed and supported. Many of the most innovative products are provided 
by smaller vendors. If you are to make the investment in data warehouse 
software and tools, you need assurance that the products will continue to 
be supported, enhanced and improved. This commitment is demonstrated by 
code quality, frequency of releases, the existence of user groups and the 
vendor's ability to provide technical, educational, installation and 
international service support. Other possible indicators include the 
vendor's level of sales and number of technical employees, years in 
business, sales offices, customer installations, etc. 

Conclusion 

When implementing a data warehouse project, you need to first define 
the objectives and decision criteria before evaluating the alternatives and 
developing the plan. 

Table 1 

Typical User Characteristics 

characteristic Power User Executives 

Level of Function Sophisticated Basic 
Response Time Medium Quick 



Accuracy 
Precision 
Openness 
Analysis 



Processing Demand 
Usability Needs 



Ad hoc, diverse 
detailed 



High 
Low 
High 
High 
High 



Medium to Low 
High 
High 

Medium to Low 
Medium to Low 
Summary , st rategi c , 
some detail 



characteristic Casual Users 

Level of Function Basic 
Response Time Medium to Low 

Processing Demand Low 
Usability Needs Medium 
Accuracy High 
Precision High 
Openness Medium to Low 

Analysis Basic, preplanned 

Table 2 

Summary Of Relative Strengths Of Hardware/Software Platforms 

MPP Mainframe Client/Server PC 
Data Storage High High Med Low 

Capacity, I/O 

Access-Usability Low Low High High 

Presentation Low Low High High 

I/S Services High High Med Low 

# Users Very High High Med Low 

Analysis High High High High 

Scalability High High Med Low 

MPP and mainframes provide better storage and I/O 

PCs provide better access, usability and presentation 

Emil T. Cipolla has more than 30 years of experience in developing 

large-scale mainframe information systems. He can be reached at 

102127 . 2451@compuserve . com . 
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ABSTRACT: A plan to update an organization's data warehouse must begin 
with a clear assessment of the objectives of the data repository and a firm 
grasp of the relevant business objectives. Elements of data warehouse 
implementation include drawing data from a range of sources, managing and 
storing the data, providing access to the end-user, providing for 
multi-dimensional analysis of the data and providing attractive 
presentational options. A key issue is whether or not the relational 
database (RDBMS) will have to be replaced with a Multi Dimensional DBMS 
system specifically designed for data warehouse applications. Most RDBMS 
systems are only able to accomplish On-Line Transactional Processing (OLTP) 
operations, while data warehousing requires On-Line Analytical Processing 
(OLAP). Users of a data warehouse can be categorized as power users, 
executive users and casual users .-. Guidelines for the selection of a 
commercial package include cost, function , performance level, scalability, 
open systems capability and vendor reliability. 

TEXT: 

When implementing a data warehouse project, you need to first define 
the objectives and decision criteria before evaluating the alternatives and 
developing the plan. 

Your executives have read the articles in the business journals and 
are convinced that a data warehouse has the potential of significantly 
improving their competitive position, and they want one. You know that a 
data warehouse system is developed, not purchased "off the shelf." As a key 
member of the team assigned to plan and implement the data warehouse, you 
have many implementation questions. This article identifies the 
implementation conside rati ons and discusses the decision criteria. 
{A-datlF^^ 

potential^ purposefully broad definition suggests 

that almost all data may be stored, accessed, analyzed and presented by 
someone, sometime. It may contain current operational data summarized along 
several dimensions, historical data and metadata. There may be industrywide 
economic data or even unstructured data. Some firms manage pedabytes 
(10. sup. 15) of data. 

Data mining is the process used to analyze and summarize data from 
many perspectives to determine patterns so users can obtain relevant 
information necessary and appropriate to their questions. 

Define Goals And Prioritize Objectives 

As with any project, before you begin a data warehouse project there 
must be a clear description and agreement of the goals and objectives. Any 
information system should exist solely to support the firm's business 
objectives. Therefore, it is critical that management clearly define its 
near-term and long-term goals. 

The obvious question is: What is the relative importance of these 
objectives? While management may indicate that all are equally important, 
project planners and developers need to know the tradeoffs and priorities. 



Often, near-term objectives and priorities are different from 
long-term ones. The short-term objective may be to develop a successful 
pilot application using Rapid Application Development (RAD) techniques. 
This pilot project can then be used to demonstrate the benefits of a data 
warehouse and to learn from the pilot, refine the techniques and extend 
them to other functions and users. 

There are five major elements involved in implementing a data 
warehouse: 

1. Extract, transform and load the data warehouse from the various 
sources 

2. Store and manage the data warehouse 

3. End-user access 

4. Data analysis from different perspectives 

5. Presentation of the results.. 

When loading the data Warehouse, the source data usually must be 
"cleaned up," extracted, transformed and denormalized before it can be 
loaded into the data warehouse. The source data may be operational data and 
generation data sets in the RDBMS ' format, hierarchical databases and flat 
files such as EBCDIC and ASCII files. The data warehouse may contain 
unstructured data in electronic mail, notes and trade journal articles as 
well as external data in public databases, such as the Internet, Dun & 
Bradstreet files, Department of Commerce Statistics, etc. You can write 
programs to load the data warehouse or purchase one of many vendor 
products. 

If the organization already has an RDBMS, a key question is whether 
the data in the data warehouse should be stored and managed by the same 
RDBMS or whether a Mul tiDimensional DBMS (MD-DBMS) specifically designed 
for data warehouse and dam mining should be used. The advantages of using 
the existing RDBMS include the fact that the staff is already familiar with 
it and the infrastructure and procedures are already in place. However, a 
key disadvantage is that these RDBMS products are designed for On-Line 
Transactional Processing (OLTP) and not for On-Line Analytical Processing 
(OLAP) of the data warehouse. An RDBMS generally can manage only structured 
data, and OLAP has significantly different characteristics from OLTP. OLAP 
systems must be able to process tens and even hundreds of gigabytes of data 
with reasonable and consistent performance and provide multiple views. Many 
contend that existing RDBMS products cannot effectively satisfy 
multidimensional OLAP requirements. 

The advantages of using an MD-DBMS include the ability to more 
effici^n^y handle large data amounts through better data storage and 
^nd^ing^techn-iqu^. Some are specifically architected to exploit the power 
of Massively Parallel Processors (MPPs) . The disadvantages include having 
to develop the infrastructure to learn, install and support another DBMS 
(manuals, education, training, etc.). You also may have another vendor, 
with the resulting interface complexities. 

The access and presentat4on-f unc-t-ions can be adequately satisfied by 
any number of excellent-products w-lth-a-^GtlT. Unfortunately, there are few 
such products available that execute on the mainframe; thus, many data 
warehouse systems are implemented in part on a PC or in a client/server 
environment. 

Many data mining products perform the basic operations necessary for 
multidimensional OLAP summarization and analysis (slice and dice, 
consolidation, drill down and basic analysis). A few products provide 
statistical functions and even artificial intelligence capabilities for 
more sophisticated analysis. 

User Classes And Requirements 

One of the first considerations is to categorize users and determine 
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their requirements. One useful way is to group users into three broad 
categories on the basis of their different interests and requirements: 

* Power users 

* Executives 

* Casual users. 

When implementing a data warehouse, the challenge is to anticipate and 
satisfy the needs of these different users (see Table 1). Since many users 
may already be familiar with different end-user produ c t s , c d at a^m : j:n i rrgrit oo:>s 
shouldzzhave^nt^ -products anH 

spreadsheets, and provide exits to the available mathematical and 
s"tatiYtical libraries. This imposes greater support requirements on the 
help desk. 

Preferably, business professionals and/or executives should not 
require the assistance of an I/S professional. Users should be able to 
perform analysis in a manner consistent with their professions. For 
example, a financial analyst might use terms such as "calculate the yield 
to maturity," or "what is the internal rate of return"; an accountant might 
use terms such as "net profit," "earnings per share" or "pro-forma" ; a 
sales manager might make a request to "compare the sales performance of 
each salesperson across all regions" or "compare the last 12 months of 
sales of the new product line with the sales of the original line," etc. 

Evaluation Considerations 

When selecting a data warehouse product or vendor, the planning team 
should develop evaluation and clii:ter^i:a~wieTg^l% before evaluating and 
selecting any product or vendor. These include: 

* Economic 

* Required function 

* Usability 

* Desired level of performance 

* Scalability 

* Openness 

* Hardware/software platform environment: 

* relative advantages of mainframes, client/server, PC 

* use existing software or install new? 

* Vendor considerations: 

* best-of -breed components or single vendor? 

* vendor viability and support (education, consulting). 

A simple 1 to 5 evaluation scale is usually used. Identifying all the 
relative factors results in better accuracy, which is more important than 
precision. The criteria and weights should reflect your specific business 
environment and requirements. Any composite score should be used to 
indicate the relative, not absolute, evaluation. 

An OLAP system is usually on a separate hardware system from the OLTP 
system, so it does not impact the operational and tactical needs of the 
organization. 

An important consideration is to decide on the hardware and software 
platforms. The primary focus of this article is on products that execute 
either on an MPP or mainframe, or cooperate with a mainframe in a 
client/server environment. Of less interest are those products that execute 
only on stand-alone personal computers or Intelligent Work Stations 
(IWSes). Often, the extracted data and/or interim results are transferred 
to an IWS for subsequent analysis by the user. 

Table 2 is a summary of the strengths of various hardware and software 
platforms. These considerations now will be "sliced and diced" and "drilled 
down . " 

Economic considerations should include the explicit hardware, software 
and network, support costs and benefits, and the hidden or intangible 



costs. Many firms do not adequately consider the intangible economic costs, 
such as lost productivity, that occur when each user must act as a system 
administrator. Recent studies suggest that the total cost of a PC-centric 
or client/server system is much higher than initially thought. In some 
instances, it is five times greater than the cost of a mainframe-based 
system. The intangible benefits include the value of better, more timely 
information. 

Price is a consideration, but more important is the relative value or 
ratio of achievable benefits to price. Users realize the importance of 
timely, useful information and are willing to pay for value received. The 
total cost of the hardware, prerequisite and OLAP software and the 
opportunity cost of late or inaccurate information should be included in 
the justification of the system. 

Users should be provided with choices regarding the level of accuracy, 
desired precision, format of the query, statistical analysis techniques, 
level of sophistication of the model, resulting output and format. Examples 
include choices regarding graphical displays (line, bar, stack, pie, 
three-dimensional), colors and shading, equations, textual reports and 
spreadsheets. 

Accuracy is not the same as precision; it is more important that users 
be provided with results that are accurate enough for the intended use than 
that they be extremely precise. For example, when preparing next year's 
budget, it is preferable that the manager know that the average salary is 
approximately $40,000 than to be told that it is $41,123.48, plus or minus 
$10,000. 

Simplicity and intuitive end-user access and usability are key 
requirements for all users. Minimally, the products should provide a GUI 
with windows, menus, icons, tool bars, etc., and use nonprocedural SQL 
language. 

Because the product may operate across multiple hardware/software 
platforms, the user interface should be consistent across all supported 
environments to minimize any education and retraining. Using the product 
should be intuitive and self-explanatory. There should be on-line, 
context-driven help that allows the user to obtain information and guidance 
regarding a topic by placing the cursor on the subject. By striking a key, 
a "pop-down" window appears with increasing levels of explanation of the 
topic. The user does not have to key in the subject; the system 
automatically provides information for the topic indicated by the cursor. 

As with OLTP, the response time for OLAP depends on the hardware and 
software platforms, the nature of the analysis and the load on the system. 
The objective for OLAP can be several orders of magnitude longer than for 
OLTP. If the normal OLTP response time on a platform is n seconds, the 
following guidelines for the OLAP response time are proposed: 

* Simple drill down and summarization -5 to lOn 

* Statistical and regression analysis, mathematical analysis (e.g., 
curve fitting) - 10 to 50 n 

* Artificial Intelligence (AI) - depending on the technique, this 
could take many minutes. In most instances, the AI tool will initiate a 
task to be executed in a batch partition. Scalability is the property that 
provides support for additional users, larger databases and higher 
performance by adding more computer resources (more storage capacity, 
processing power, terminals, etc.) without changing the fundamental 
operating environment, application or operating procedures. Doubling the 
power should allow doubling the number of users with no degradation in 
throughput or performance; doubling resources should provide the same 
number of users with twice the performance. 

Scalability can be achieved vertically (adding resources to a single 



processor) or horizontally (by enabling multiple processors to 
cooperatively operate and share the workload transparently). Typically, 
systems are scaled using: 

* Larger uniprocessors 

* Tightly coupled multiprocessors sharing memory and DASD 

* A loosely coupled multiprocessor that may share DASD but have 
separate processors or main memory 

* MPPs with hundreds of processors. 

Openness is the ability to interface with other vendors 1 products and 
other hardware and software platforms. The data should be importable from 
and exportable to the popular word processing, database and spreadsheet 
products . 

In a concurrent multiuser access and usability environment, data 
integrity, data security and privacy are critically important. The 
organization needs to balance usability and accessibility with the need to 
protect the fundamentally valuable corporate data asset against 
unauthorized access. Recent articles have discussed instances where 
businesses incurred billions of dollars in lost or damaged data because of 
inadequate data access security protection against malicious actions, fraud 
and computer viruses. 

When building a data warehouse, you may select products from the same 
vendor (or from its business partners) or you may select "best-of-breed" 
products from several vendors and integrate them into a cohesive system. 
With the latter, you can be the systems integrator or rely on one of the 
many consulting firms that offer integration installation and turnkey 
services. If data mining system products are selected from multiple 
vendors, then they must be integrated to work together (not a trivial 
task) . 

The "staying power" arid viability of a vendor is as important as the 
functional characteristics of the product. Recently, there have been 
numerous consolidations, restructuring and mergers in 1/1. This has 
resulted in many software products being stabilized or no longer being 
marketed and supported. Many of the most innovative products are provided 
by smaller vendors. If you are to make the investment in data warehouse 
software and tools, you need assurance that the products will continue to 
be supported, enhanced and improved. This commitment is demonstrated by 
code quality, frequency of releases, the existence of user groups and the 
vendor's ability to provide technical, educational, installation and 
international service support. Other possible indicators include the 
vendor's level of sales and number of technical employees, years in 
business, sales offices, customer installations, etc. 

Conclusion 

When implementing a data warehouse project, you need to first define 
the objectives and decision criteria before evaluating the alternatives and 
developing the plan. 

Table 1 

Typical User Characteristics 



characteristic 
Level of Function 
Response Time 
Processing Demand 
Usability Needs 
Accuracy 
Precision 
Openness 
Analysis 



Power User 
Sophisticated 
Medium 
High 
Low 
High 
High 
High 

Ad hoc, diverse 
detailed 



Executives 
Basic 
Quick 

Medium to Low 

High 

High 

Medium to Low 
Medium to Low 
Summary, strategic, 
some detail 



characteristic Casual Users 

Level of Function Basic 
Response Time Medium to Low 

Processing Demand Low 
Usability Needs Medium 
Accuracy High 
Precision High 
Openness Medium to Low 

Analysis Basic, preplanned 

Table 2 

Summary Of Relative Strengths Of Hardware/Software Platforms 

MPP Mainframe Client/Server PC 
Data Storage High High Med Low 

Capacity, I/O 

Access-Usability Low Low High High 

Presentation Low Low High High 

I/S Services High High Med Low 

# Users Very High High Med Low 

Analysis High High High High 

Scalability High High Med Low 

MPP and mainframes provide better storage and I/O 

PCs provide better access, usability and presentation 

Emil T. Cipolla has more than 30 years of experience in developing 

large-scale mainframe information systems. He can be reached at 

102127 . 2451@compuserve . com . 
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ABSTRACT: Online analytical processing systems deliver information for 
unexpected and unstructured situations. Analysis processing can involve 
simple or complex questions as well as required analysis from various 
angles. Data mining's analysis processing is multidimensional and performed 
on-the-fly. It allows different departments to gather small to large 
amounts of data in as honed a fashion as is desired. All OLAP and data 
mining tools are customizable by size and specificity for searches. OLAP 
and data mining products that transgress basic analytical capabilities move 
into the realm of mathematical and statistical routines for trend analysis 
or curve fitting. Knowledge-Based Systems-enabled (KBS) data mining tools 
can discover patterns in data that are otherwise not readily apparent. Some 
products display results in graphical format, others use scales or factors. 
Statistical techniques are most relevant when specific variables exist and 
their relationships are under investigation. 

TEXT: 

Many firms have available a vast amount of operational, historical and 
external data from industry or government sources. Increasingly, this data 
is stored in a data warehouse and managed and accessed differently from the 
operational system, Some firms stiffer from data overload, or too much data 
to effectively analyze. Techniques are being used to sift through this vast 
amount of data and determine the relevant information necessary to answer 
the questions at hand. 

Operational systems provide answers to well -structured, predefined 
situations such as sales, inventory control, accounts receivable, accounts 
payable and general ledger systems. Batch and On-Line Transactional 
Processing (0LTP) programs are developed to satisfy these structured 
(although possibly complex) events. 

Separate systems are necessary to enable you to answer situations that 
are not predefined, Unlike 0LTP systems. On-Line Analytical Processing 
(OLAP) systems provide information for unstructured and unanticipated 
situations. Typically, an analyst reviews a large amount of current or 
historical data in an exploratory, unplanned or ad hoc manner to detect 
some pattern or trend. 

There are many aspects of data mining and OLAP; this article focuses 
on the use of more sophisticated techniques in analysis processing. 

Analysis Processing 

The questions and the required analysis may be simple or complex and 
may be analyzed from many different perspectives. For example, a sales 
manager may analyze sales by customer, listing customers with the highest 
total amount of sales, the lowest, the average amount and the total sales 
by product line, time period, geographic region, etc:. A product manager 
may analyze the same data by product line, a com troller by time period, a 
regional manager by geography, a development manager by feature, etc. 

Data mining is the process of analyzing this data from all these 



perspectives in an act hoc, multidimensional manner. Data mining allows 
different individuals to retrieve as much or as little data as they need. 
They can retrieve current detail data, historical clam or even external 
tiara and summarize it by any desired category. All data mining and OLAP 
products provide a basic analysis capability (minimum, maximum, average) 
and the ability to drill down to obtain more detail anti summarize details 
as necessary. 

Some data mining and OLAP products go beyond basic analysis 
capabilities anti provide statistical and mathematical routines to 
calculate the coefficients and powers of prespecified independent 
variables. This is known as curve fitting or trend analysis. Trend analysis 
is used to determine patterns and relationships, and if key measurements 
are still within limits and expectations. 

It is not difficult to determine patterns when the variables are 
known. Mathematical techniques are available to determine the relationship 
of one dependent variable (such as sales of a particular model) to several 
independent variables (such as the selling price, competition, number of 
sales channels, sales of complementary products, inflation rate, etc.). The 
resulting mathematical expression may Ix: complex and have several 
independent terms with coefficients raised to different powers. Note that 
while the mathematics may be complex and require a large amount of system 
resources, the analyst must first specify the dependent and independent 
variables before the program can determine the coefficients and powers. 

However, the difficulty is identifying key patterns and trends when 
the analyst does not know the independent variables or, for that matter, 
may not even know what dependent variable should be analyzed. This is where 
more sophisticated and powerful Knowledge-Based System (KBS) techniques can 
be useful . 

Knowledge-Based Systems 

A data mining product with KBS capabilities can detect patterns in 
data that are not readily apparent; that is, it can indicate that there is 
a relationship between one dependent variable and one or more independent 
variables. Some products utilize data visualization techniques to 
graphically present these relationships. The analyst can change the scale, 
display format and factors to better represent the relationships. The 
relationship need not be a causal one, nor is it most likely readily 
apparent. 

For example, KBS can determine that when the July sales of chocolate 
ice cream in a supermarket with more than 50,000 square feet of space are 
more than 500 gallons, then the November sales of red wine in the adjacent 
specialty store will be more than 10 cases. 

The key point is that statistical techniques are appropriate when the 
variables are specified by the analyst beforehand, and their relationship 
must be determined. However, the more sophisticated systems analyze the 
data not only to determine the relationship, but also to identify what are 
the relevant variables. These techniques first calculate the coefficients 
and variables and develop the rules and patterns. Then they may analyze 
additional data and modify the rules by changing the coefficients or even 
by adding or deleting independent variables without needing the analyst's 
judgment and intervention. 

Some Knowledge-Based Techniques 

Several KBS technologies are of interest. The technologies can be 
described as being either qualitative reasoning and model-based, 
quantitative analysis and rule-based, or from some other perspective. 

Qualitative vs. Quantitative 

Models are used in many data processing and data mining applications 
and often are imprecise. They purposefully simplify the assumptions, facts 



and problem-solving process, hoping the results will still adequately 
represent the real world. The objective is to develop a model at the lowest 
cost that completely depicts the problem, is accurate and precise, but is 
easy to use. Ideally, the model is created from as small a sample as 
possible and can successfully and reliably predict new cases. Often, very 
adequate results are obtained through the use of relatively simple rules. 

Model-based KBS systems explicitly state the fundamental dependent 
relationships among the variables. Model-based systems generate, test and 
evaluate the solution for consistency; feedback controls are used to 
constrain the acceptable solution. Heuristics (discussed later in this 
article) are used to terminate the search when it seems that the most 
likely alternatives have been identified and evaluated. Qualitative models 
are used for situations involving vague conditions, complicated 
interactions among the factors and inequalities. Qualitative models often 
provide a range of acceptable solutions and differing interpretations. 

Quantitative methods are usually more specific and precise. KBS sones 
complex problems by applying facts and rules that mimic human reasoning. 
Facts express a truth concerning some entity. Rules are relationships among 
facts about certain situations. An example in programming is the IF . . . 
THEN ... ELSE logic; that is, IF fact 'NT is true, THEN perform action 'X 1 ; 
ELSE perform 'Y' . 

The rules may be simple (as in the definition of profit being equal to 
income minus expense) or complex (as in the calculation of federal tax 
liability). Some rule-based systems screen the alternatives and courses of 
action, rank them and select the highest-scoring candidate. Often, 
rule-based models hide the underlying assumptions, do not allow for 
judgment and exceptions or are inflexible. They usually are impractical to 
code for all combinations and normally cannot handle subtleties. 

Expert Systems 

Expert systems use predefined logical "rules" to analyze a problem. 
Typically, application experts are interviewed to capture their 
problem-solving approaches. Their thoughts and decision-making processes 
are then codified into a set of rules that is applied to new situations. 
The relationships and rules for all steps are well-known and are 
methodically followed without modification. An example of an expert system 
is the creation of a sales invoice. The unit price is multiplied by the 
number of units, then the discounts and sales taxes are applied to arrive 
at the total cost. 

An example of a rule is the shortest rule. Pruning is a technique of 
continually deleting components of the model and comparing the results to 
the previous results. If the pruned performance is equal to or better than 
that of the unpruned model, then the pruned model is used. An, equation may 
consist of several independent variables, each raised to a high-order 
power. You then reduce the number of variables and/or reduce the power 
until the result is no longer better than the previous result. The pruning 
technique is appropriate in many commercial and business applications. 

These rules are usually employed in one of two ways. 

* Forward chaining (or data-driven) -- This method starts with a set 
of initial facts and evaluates rules to generate new facts until the goal 
is achieved. For example, when developing a personal budget, you start with 
facts stating income and expenses. KBS applies the rules appropriate to the 
level of mandatory expenditures. These lead to new facts regarding 
categories of allowable expenditures for food, housing, transportation, 
etc. The result is a completed budget. Often, the KBS model does not have 
rules that cover all situations; note that there were no rules regarding 
the acceptable level of borrowing. This omission is usually handled in KBS 
systems by using heuristic techniques. 



* Backward chaining (or goal-driven) — This search method starts with 
a goal and continually evaluates rules to establish subgoals. In program 
debugging, the goal is to find and correct the cause of the error. If there 
is a compile error, you find and correct the cause and recompile the 
program. Where expert systems start with the rules and apply them to the 
data, other more sophisticated tools are able to analyze the data, derive 
and modify the rules and apply them to additional situations (e.g., "learn" 
from new data). These are called heuristic systems. 

Heuristic Systems 

Heuristics is a Greek term meaning to discover and be self-learning. 
In heuristics, there is a reasonable confidence that a series of steps will 
lead to a successful solution. The steps are followed and modified until 
the desired results are achieved. In programming, inductive systems begin 
with a proposed answer and then modify it as new facts are learned and 
evaluated. Program debugging, automobile and medical diagnosis, and 
detective work are often heuristic processes. Some heuristic technologies 
include the following. 

* Fuzzy logic and probabilistic reasoning-techniques for writing rules 
that handle the vagueness and imprecision of many business concepts or 
partially true data (such as new vs. old, teenager vs. adult buying 
patterns, fast vs. slow, etc.). The rules may be either sophisticated or 
simple and can deal with subtleties. 

* Nearest neighbor—an attempt is made to identify the existing subset 
of the clam population that most closely resembles the characteristics of 
the subset under evaluation. The closest neighbors share many similar 
attributes but are different enough to be in separate groups. 

* Neural networks — a network of many simple processors (units). Each 
processor may have many inputs and outputs, and each performs a simple 
computation to produce the output. The computation may be different for 
each of the units. Neural networks are useful for the solution of 
pattern-recognition and data- filtering problems. 

Scoring, Verification And Discovery 

Another perspective is to categorize the KBS techniques as either 
scoring, verification or discovery. 

In scoring, the attributes or characteristics of a finite number of 
subgroups are identified. Examples might be age, sex, marital status 
income, educational level, etc. Each attribute is assigned a numerical 
value. Then, each new instance is assigned to one of the subgroups by 
scoring the attributes. This technique works well for a small number of 
attributes but requires expert detailed knowledge of the model. Since it is 
knowledgeintensive and can be time-consuming, its use is usually limited to 
business professionals rather than executives. 

In verification, the data is iteratively analyzed by issuing a series 
of simple queries. The interim results are sliced and diced repetitively 
until the desired level of accuracy, precision and usability is obtained 
(or you become exhausted!). 

In discovery, rules are created concerning the data (this is 
essentially the KBS approach, in which the system evaluates the data 
characteristics) . 

Many of these technologies have been available in the research and 
academic environments for years; recently, comprehensive KBS development 
tools have become available to the commercial information processing arena. 
These tools, which are available in several hardware/software environments, 
have enabled leading-edge I/S organizations to develop KBS-based 
applications. The tools contain a rich set of built-in functions, editors 
and interfaces to external databases. Some of these applications use the 
building blocks available from vendors; others are internally developed by 



companies for a unique application or to preserve their competitive edge. 
Typical KBS Applications 

Where are the most appropriate applications of KBS in OLAP and data 
mining? Typically, the solution involves judgment and a set of complex 
interrelated facts and rules, and the solution often need not be exact or 
unique. 

Accounting data is analyzed beyond the traditional prof i t-and-loss and 
balance sheet reports to determine the contribution to net profits by 
product line, region or cost center. Product managers might set the desired 
profit objectives and then determine the required amount of Sales for 
different combinations of product line, brand, sales channel or region 
necessary to achieve these objectives. 

In finance, expert systems are used to state the relationships between 
a fundamental financial instrument (such as a Treasury Bill) and financial 
derivatives to predict the changes in market valuation between them. In 
another example, an expert system is used to estimate the accuracy of 
financial statements; the model develops expectations regarding what the 
financial results should be. To the extent that these 1 expectations are not 
met, then further investigation (drilling down of the data) is performed. 
In addition, major banks use KBS to analyze the Financial markets and audit 
foreign exchange transactions. 

Sales data is compared to past periods and to the sales of 
competitors. A typical question might be: "What was the impact of the 
increase in advertising on the level of sales of the product, and how does 
that compare to the level of sales resulting from last year's sales 
promotion campaign. 

In marketing, the relevant market is segmented into homogeneous 
subgroups that share similar demographics (age, sex, ethnicity, marital 
status, etc.), geographic location or industry classification. The 
objective is to determine the factors that distinguish the buying patterns 
of the members of the subgroup from those in other subgroups. 

Direct marketers analyze their databases to gain a Competitive 
marketing advantage. One firm has a database containing more than 100 
million records of past customer purchases that they use to send targeted 
promotional offerings based on the customer's purchase history. They can 
design and execute precise marketing programs with predictable and 
measurable results. 

Tobacco firms use a database of smokers 1 names and addresses to send 
coupons and free samples directly to consumers. The firms also can identify 
likely buyers of their products or send promotional offers to consumers of 
rival brands . 

A major video rental firm uses its records of more than 40 million 
households to achieve several objectives. First, it can recommend movie 
rentals to individual customers based on past rentals; this results in 
improved service. Second, it can reduce its overall inventory level and 
costs by stocking only the most popular titles. The resulting cost savings 
can be passed on to customers through lower prices, or the firm can 
increase profits. 

A major credit card firm prints targeted promotional offers next to 
the invoice lines on monthly statements. If the customer's invoices show a 
pattern of charges for airline trips to England, the billing statement may 
include a note of an upcoming sale at Harrod's department store in London. 
KBS is also used in cred.it card authorizations to evaluate margin credit 
accounts and in fraud detection 

In I/S, many organizations use KBS technologies to help automate the 
help desk, diagnose system problems, improve database design and system 
capacity utilization and efficiency, and automate operations. 



Conclusion 

The preceding are only a few examples of how KBS technologies are used 
in OLAP and data mining to solve business problems. 

It is the effective use of KBS technologies for the OLAP data analysis 
function that distinguishes the relatively few leading-edge data mining 
products from those that only offer data drilling, consolidation and 
slicing-and-dicing functions. 

Emil T. Cipolla, M.S.M.E. and M.B.A., has more than 30 years of 
experience in developing lathe-scale mainframe information systems. He can 
be reached at 102127.2451® Compuserve, com. 

Ed Boden is a development programmer with IBM with 15 years of 
experience. He has developed operating systems and networking and system 
management software on a variety of platforms. He can be reached at 
102103 . 152@compuserve . com 
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DOCUMENT- IDENTIFIER: US 5508733 A 

TITLE: Method and apparatus for selectively receiving and storing a plurality of 
video signals 

Brief Summary Text (42) : 

Channel availability has been a crucial limitation in the broadcasting industry. 
Channel allocation has been very valuable and expensive. It has precluded several 
interested individuals, small businesses, consumers, and local community chapters 
from accessing the TV broadcasting networks, in order to express personal views or 
to advertise . 

Brief Summary Text (79) : 

Four-tube (luminance-channel) cameras were then introduced when color receivers 
served a small fraction of the audience . The viewer of color program in monochrome 
became aware of lack of sharpness. Using a high-resolution luminance channel to 
provide the brightness component in conjunction with three chrominance channels for 
the Red (R) , Green (G) and Blue (B) components produced images that were sharp and 
independent of registry errors. 

Brief Summary Text (109): 

Nowadays, small businesses and individuals find it quite prohibitive to advertise 
and/or to express their views in conventional publications , such as newspapers . As 
the cost of printed publications rises with the continuing decrease of natural 
resources, it will become even more forbidding for individuals and small businesses 
to retain, even the limited access to printed publications, they now enjoy. This 
problem will become a major concern in the near future, as it will very subtly 
become an indirect restraint on the constitutional freedom of speech. 

Brief Summary Text (165): 

51. U.S. Pat. No. 5,099,319, to Esch et al. generally discloses an apparatus having 
a central site and a remote site for customizing advertising for television using a 
video signal comprising a communication channel, and video and communications 
processors. The video processor mixes the first content data signal with the video 
signal. A cue processor generates insertion signals. 

Brief Summary Text (168): 

The Esch et al . U.S. Pat. No. 5,099,319 generally describes a video information 
delivery apparatus for customizing advertising for television . As exemplified by 
claim 2, the apparatus includes a studio processor and storage, for generating and 
storing content data signals. A schedule -processor is responsive to the content 
data signals for generating a schedule data signal. A network processor generates 
ac communications data signal, and a transmitter transmits the communications 
signal. A control processor coordinates the operation of the studio-processor, 
schedule -processor and network processor. 

Detailed Description Text (305) : 

Cable television systems in the United States carry an average of 35 channels of 
diversified programming services. Higher capacity systems are currently being 
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designed to 80 channels (550 MHz) on a single coaxial cable. Commercial program 
insertion systems, such as spot advertising, cross-channel promotional, barker 
insertions and network non-duplication have evolved somewhat independently in cable 
systems, and it would be desirable to integrate these program insertion systems 
within the cable television network. 

Detailed Description Text (309) : 

The VAD mapping system could also be used by the advertising agencies to reserve 
their spots, similarly to the reservation network used by travel agents. 

Detailed Description Text (320) : 

The multiplexed signals could be stored in storage 243 for several purposes, such 
as for later transmission to the end users or to other stations, according to an 
established schedule. 



Detailed Description Text (333) : 

The transmission station 204A includes a computer 53 which is the central control 
unit for the signal samplers 206, 208, 210; the compressors 216, 218, 220; the 
multiplexer 222; the storage unit 242; and the selectors 275 and 275A. In the 
preferred embodiment, the selector 275 is used to control the multiplexing and 
transmission of selected channels, while the selector 275A is used to control the 
initial reception of incoming channels (1 through n) . Thus, if the computer 53, 
determines that only a certain number of channels (i.e. 1 and 2) have been 
selected, via the selectors 275 and 275A, then it can either disable the operation 
of the non functional samplers (i.e. 210); or, in the alternative, it could use 
them to assist in alleviating the traffic on congested circuits. In this manner, 
the operation of the transmission station 204A is optimized . 

Detailed Description Text (372) : 

The present invention can also be used in video-audio-data mail applications, where 
a sender of information can leave encoded video, audio and/or data (VAD) messages, 
on a recorder, such as a conventional video recorder. When these VAD messages are 
to be retrieved, they are demultiplexed, demodulated and decoded according to the 
above teachings. The present video modulation system has several military 
applications in that it allows the encoding of video, audio and data signals in a 
non-decodable format by unauthorized users. 
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